This RMD serves as a guide/lab journal to document my methods for processing and modelling FT-NIRS otolith spectra from juvenile Pacific cod (Gadus macrocephalus) in the Gulf of Alaska. These scans were acquired on a Bruker MPA II with an integrating sphere, 5mm. teflon disk taped over the scanning window and gold transflectance stamp placed over the top of samples. Spectra were collected between 11,500 and 4,000 cm-1 with a resolution of 16 cm-1 and 64 replicate scans.
LPW Scans refer to specimens collected near Little Port Walter (LPW), a NOAA research station located near the southern tip of Baranof Island in Southeast Alaska. Juvenile cod were collected using beach and purse seines from embayments near LPW on July 27-28, 2020. These fish were transferred into outdoor net pens at LPW to be reared in captivity. Specimens were collected approximately weekly (ADD MORE IN HERE LATER)
This segment is adapted from code provided by Dr. Esther Goldstein at NOAA’s AFSC in Seattle. (ADD INFO ON PACKAGES INCLUDED?)
# Install packages not yet installed
packages <- c("caret", "corrplot", "devtools", "dplyr", "gglm", "ggplot2", "gratia", "gridExtra", "knitr", "lubridate", "mdatools", "mgcv", "MuMIn", "opusreader2", "prospectr", "splitstackshape", "tidyr", "viridis")
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
utils::install.packages(pkgs = packages[!installed_packages])
}
invisible(lapply(packages, library, character.only = TRUE)) # load all packages in list
# install packages not on CRAN
if (!require("remotes")) install.packages("remotes")
if (!require("opusreader2")) {
remotes::install_github("spectral-cockpit/opusreader2")
library("opusreader2")
}
# if (!require("simplerspec")) {
# remotes::install_github("philipp-baumann/simplerspec")
# library("simplerspec")
# }
rm(installed_packages, packages) # remove objects from environment
Some notes on metadata variables
specimen)length)weight)structure_weight)read_age) in days representing an
average of multiple reads (agreement <= 5%)side)percent_affected)other_problem)broken)scan_name,
run_number) to indicate separate scans for the same
specimen (LPW specimens scanned in triplicate)file_name)The Bruker MPA outputs FT-NIRS spectra in a .0 format. Prior to
loading spectral data, a metadata .csv (LPW_metadata.csv)
was generated to pair specimens biological data with their respective
FT-NIRS scans.
To facilitate easier .0 file loading, all file names for .0 files should be added to the metadata file. All filenames in a folder can be generated with list.files(). For pulling file names to create a new .csv, you can use:
This output (without index numbers in square brackets) can be copied
and pasted into Excel, then converted into columns using the “Data ->
Text to Columns” function and selecting a “space” delimiter. These
columns can then be transposed into a file_name column.
First I import my metadata (LPW_metadata.csv) and
convert the session_title (i.e. scan #) to a factor. LPW
FT-NIRS scans were taken in triplicate
(NIR_LPW202001202_otoliths,
...otoliths_rescan_1, ...otoliths_rescan_2) to
see if spectra changed after storage. Metadata for LPW contained an
additional entry (..._rescan_3) with some notes and
comments, however I will only be importing the metadata for actual
scans.
# 66 missing from first 2 scans, 3 & 4 have it though (removed from dataset for now, may incorporate back in)
meta_LPW <- read.csv("LPW_metadata.csv") # import metadata
levels(as.factor(as.character(meta_LPW$session_title))) # set session_title to factor (three scans taken at separate times).
## [1] "NIR_LPW202001202_otoliths" "NIR_LPW202001202_otoliths_rescan_1"
## [3] "NIR_LPW202001202_otoliths_rescan_2" "NIR_LPW202001202_otoliths_rescan_3"
meta_LPW <- meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths", "NIR_LPW202001202_otoliths_rescan_1", "NIR_LPW202001202_otoliths_rescan_2", "NIR_LPW202001202_otoliths_rescan_3"), ]
meta_LPW$specimen[meta_LPW$file_name == ""] # check for missing file_names in metadata
## integer(0)
meta_LPW <- meta_LPW[meta_LPW$file_name != "", ] # remove any rows that are missing a file_name in the metadata
# check number of specimens for each scan session (122 specimens, 365 total scans after excluding specimen 66, scan 1)
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths"), ])
## [1] 121
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths_rescan_1"), ])
## [1] 121
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths_rescan_2"), ])
## [1] 121
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths_rescan_3"), ])
## [1] 121
meta_LPW$sample_date <- lubridate::as_date(mdy(meta_LPW$sample_date)) # convert sample date to lubridate format
# generate file paths to import with Opusreader
meta_LPW$file_path <- paste0(getwd(), "/", "/LPW_scans/", meta_LPW$file_name) # generate filepaths for all .0 files
Opusfiles <- as.vector(meta_LPW$file_path) # store filepaths in object
exists <- as.vector(lapply(Opusfiles, file.exists)) # check that I have all my files or else I get an error when I read them in
meta_LPW$exists <- exists # add column to metadata dataframe to confirm file exists in directory
meta_LPW1 <- meta_LPW[meta_LPW$exists == "TRUE", ] # filter the file list and data by otoliths with spectra files
Opusfiles <- as.vector(meta_LPW1$file_path) # from Esther, "I repeated this and wrote over it so I wouldn't have extra files to read in that don't exist and produce an error"
rm(exists)
rm(meta_LPW1)
Once all metadata has been imported into a dataframe with filepaths
for .0 files, FT-NIRS scans can be imported using the
opusreader2 package and read_opus(). First a
single file is read-in to make sure the script works and everything
looks good; then all .0 files are imported using the
Opusfiles object. read_opus() creates a
massive list of lists that includes information about MPA II settings,
metadata generated during scans, specimen names, etc. See
?read_opus() from the opusreader2 package for
more details on the list structure of .0 files.
# read a single file (one measurement) to check that script runs without failure
file <- Opusfiles[1]
data_list <- opusreader2::read_opus(dsn = file) # NA's seem to be introduced due to a timestamp metadata issue, currently using suppressWarnings to hide the annoying output, but this is normal.
rm(data_list)
rm(file)
SPCfiles_nooffset <- suppressWarnings(lapply(Opusfiles, opusreader2::read_opus)) # Read in all files in the Opusfiles list. This gives an error if any file names or paths are wrong.
# uncomment to see additional output from .0 files
# str(SPCfiles_nooffset[[1]]) # check first element
# SPCfiles_nooffset[[1]][[1]]$ab$data # can see spc values this way I think
# SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters #this has info about what setting was used (here otolith), species, and file name
# SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters$FC2$parameter_value # species
# SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters$FD1$parameter_value # unique ID, including project title (LPW2020), experimental designation (01 for experimental, mortalities are numbered 02), species code (202), specimen number (_1_) and scan number (OA1)
# SPCfiles_nooffset[[1]][[1]]$ab$wavenumbers # wavenumbers scanned
# SPCfiles_nooffset[[1]][[1]]$instrument_ref$parameters$INS$parameter_value # instrument name
Once all .0 files with included metadata have been successfully read-in, FT-NIRS scan data can be extracted from the lists.
# extract absorbance measurements, wavenumbers scanned, unique specimen ID, species & instrument name from .0 files list
spectra <- lapply(SPCfiles_nooffset, function(x) x[[1]]$ab$data) # FT-NIRS scan data, absorbance measurements at all 949 wavenumbers
wavenumber <- lapply(SPCfiles_nooffset, function(x) x[[1]]$ab$wavenumbers) # list of all wavenumbers scanned
file_id <- lapply(SPCfiles_nooffset, function(x) x$lab_and_process_param_raw$parameters$FD1$parameter_value) # unique ID in Opus
species <- lapply(SPCfiles_nooffset, function(x) x$lab_and_process_param_raw$parameters$FC2$parameter_value) # species
instrument <- lapply(SPCfiles_nooffset, function(x) x[[1]]$instrument_ref$parameters$INS$parameter_value) # instrument name
# create dataframe out of list
spectra <- lapply(spectra, as.data.frame)
# add metadata to dataframe
for (i in 1:length(spectra)) {
colnames(spectra[[i]]) <- wavenumber[[i]] # assign wavenumbers to columns of absorbance measurements
}
for (i in 1:length(spectra)) {
spectra[[i]]$species <- species[[i]]
}
for (i in 1:length(spectra)) {
spectra[[i]]$file_id <- file_id[[i]]
}
for (i in 1:length(spectra)) {
spectra[[i]]$instrument <- instrument[[i]]
}
for (i in 1:length(spectra)) {
spectra[[i]]$file_path <- Opusfiles[[i]]
}
# extract file_name from file_path: test code on one file to
# try <- spectra[[1]] # specimen 1, scan session 1
# splitstackshape::cSplit(as.data.frame(try$file_path), sep = "/", splitCols = "try$file_path", type.convert = FALSE) %>% select(tail(names(.), 1))
# rm(try)
# create file_name variable for all scans
file_name <- lapply(spectra, function(x) splitstackshape::cSplit(as.data.frame(x$file_path), sep = "/", splitCols = "x$file_path", type.convert = FALSE) %>% select(tail(names(.), 1)))
# file_name[[1]][[1, 1]] # check that first element is correct
# add file_name variable to spectra dataframe
for (i in 1:length(spectra)) {
spectra[[i]]$file_name <- file_name[[i]][[1, 1]]
}
rm(file_id, file_name, instrument, species, wavenumber, Opusfiles, i, SPCfiles_nooffset)
df <- as.data.frame(do.call(rbind, spectra)) # convert spectral list/dataframe to dataframe object
dfmeta_LPW <- dplyr::left_join(meta_LPW, df, by = c("file_name", "file_path")) # join metadata to spectral data
rm(df)
dfmeta_LPW <- dfmeta_LPW %>% select(-c("exists", "instrument")) # remove exists and instrument columns
dfmeta_LPW$run_number <- as.factor(dfmeta_LPW$run_number)
rm(meta_LPW, spectra)
colnames(dfmeta_LPW) <- as.character(colnames(dfmeta_LPW)) # make sure column names are characters, not numeric
dfmeta_longLPW <- pivot_longer(dfmeta_LPW, cols = -c(1:20)) # pivot absorbance measurements to long format for easy plotting
dfmeta_longLPW <- dfmeta_longLPW %>% rename(., "wavenumber" = "name") # rename name column to wavenumber for clarification
dfmeta_longLPW$wavenumber <- as.numeric(as.character(dfmeta_longLPW$wavenumber)) # change class of wavenumber variable to a numeric
# Plot 7 specimens to see if data loaded properly.
ggplot() +
geom_path(
data = dfmeta_longLPW[dfmeta_longLPW$specimen %in% c(1, 10, 100, 101, 102, 103, 104), ],
aes(x = wavenumber, y = value, color = session_title, group = file_name), linewidth = .5
) +
scale_x_reverse() +
labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1))) +
theme(
axis.text = element_text(size = 10),
axis.text.x = element_text(size = 12, angle = 25),
axis.title = element_text(size = 12),
legend.position = "none",
strip.text = element_text(size = 14),
legend.text = element_text(size = 10),
legend.title = element_text(size = 12),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")
) +
facet_wrap(~specimen)
# Save RDS for data transparency and easily enable others to work with dataframe without all these complicated import steps
# saveRDS(dfmeta_LPW, "RDS_dataframes/LPW_dfmeta.RDS")
Some functions useful for exploring data, including quick plotting PCA and seeing impact of different preprocessing filters
# ggplot function to plot spectra, colored by some factor
spec.fig <- function(mydf, color) {
# color <- as.character({{color}})
ggplot(mydf) +
geom_path(aes(x = name, y = value, color = {{ color }}, group = file_name)) +
scale_x_reverse() +
scale_color_viridis() +
labs(y = "Preprocessed absorbance", x = expression(paste("Wavenumber ", cm^-1)))
# theme(axis.text = element_text(size = 16),
# axis.text.x = element_text(size = 12, angle = 25),
# axis.title = element_text(size = 14),
# legend.position = "right",
# strip.text = element_text(size = 14),
# legend.text = element_text(size = 14),
# legend.title = element_text(size = 14),
# panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# panel.background = element_blank(),
# axis.line = element_line(colour = "black")
# )
}
# function to plot results of different savitzkyGolay filter params
sg_plotting <- function(color, m, p, w) {
dftempproc <- as.data.frame(
cbind(
scan_avg[, c(1:20)],
savitzkyGolay(scan_avg[, 21:length(scan_avg)], m = m, p = p, w = w)
)
)
dftempproc_long <- tidyr::pivot_longer(dftempproc, cols = c(21:length(dftempproc)))
dftempproc_long$name <- as.numeric(as.character(dftempproc_long$name))
spec.fig(mydf = dftempproc_long, color = {{ color }}) +
ggtitle(paste("diff = ", {{ m }}, "poly = ", {{ p }}, "window = ", {{ w }}))
}
# function to apply savitzkyGolay filter with selected parameters to dataframe and create new temp_proc and temp_proc_long dataframes for modelling
quickproc <- function(df, m, p, w) {
temp_proc <<- as.data.frame(
cbind(
{{ df }}[, c(1:20)],
savitzkyGolay({{ df }}[, 21:length({{ df }})], m = m, p = p, w = w)
)
)
temp_proc_long <<- tidyr::pivot_longer(temp_proc, cols = c(21:length(temp_proc)))
temp_proc_long$name <<- as.numeric(as.character(temp_proc_long$name))
}
Preliminary spectra and PCA plots were explored to see whether spectra of different scans were similar enough to warrant combining for analyses. Previous research has shown FT-NIRS spectra can change quite a bit over time and the triplicate scans of LPW otoliths were collected over nearly a year between the first and third scans.
# Plot FT-NIRS scans, all 4 runs, colored by run_number
ggplot(dfmeta_longLPW) +
geom_path(aes(x = wavenumber, y = value,
color = run_number, group = file_name),
alpha = 0.5) +
labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
scale_color_viridis(discrete = T) +
scale_x_reverse() + # FT-NIRS spectra often displayed with reversed x-axis for otoliths - lower wavenumbers generally more informative and are more easily seen away from axes
ggtitle("LPW FT-NIRS Spectra, All Scan Sessions")
There appear to be slightly different values for each scan - First I look at pairings of scans
plot1 <- ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(1, 2), ]) +
geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) +
labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
scale_color_viridis(discrete = T) +
scale_x_reverse() +
ggtitle("LPW FT-NIRS Spectra, Scans 1 & 2")
# Scans 1 & 3
plot2 <- ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(1, 3), ]) +
geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) +
labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
scale_color_viridis(discrete = T) +
scale_x_reverse() +
ggtitle("LPW FT-NIRS Spectra, Scans 1 & 3")
# Scans 2 & 3
plot3 <- ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(2, 3), ]) +
geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) +
labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
scale_color_viridis(discrete = T) +
scale_x_reverse() +
ggtitle("LPW FT-NIRS Spectra, Scans 2 & 3")
# Scans 2 & 3
plot4 <- ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(3, 4), ]) +
geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) +
labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
scale_color_viridis(discrete = T) +
scale_x_reverse() +
ggtitle("LPW FT-NIRS Spectra, Scans 3 & 4")
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)
rm(plot1, plot2, plot3, plot4)
Scan 1 looks quite different than 2, 3 & 4. These should be triplicate scans with identical values for each of the 121 specimens. To confirm, I extract/plot the first 2 principal components for all FT-NIRS wavenumbers (indices 21:ncol(dfmeta_LPW)) and color points by run number.
pca_all_LPW <- pca(dfmeta_LPW[21:ncol(dfmeta_LPW)], scale = T)
pcs <- as.data.frame(cbind(pc2 = pca_all_LPW$calres$scores[, 2], run_number = dfmeta_LPW$run_number)) # extract scores for PC2
pcs <- cbind(pc1 = pca_all_LPW$calres$scores[, 1], pcs) # extract scores for PC1
# plot PC1 & PC2, colored by run_number
ggplot(pcs) +
geom_point(aes(x = pc1, y = pc2, color = as.factor(run_number)), size = 3) +
labs(x = paste("PC1 (",
round(pca_all_LPW$calres$expvar[1], digits = 3), # variance explained by PC1
"% var. explained )"),
y = paste("PC2 (",
round(pca_all_LPW$calres$expvar[2], digits = 3), # variance explained by PC2
"% var. explained )"),
color = "Run Number"
) +
scale_color_viridis(discrete = T)
rm(pca_all_LPW, pcs, dfmeta_longLPW)
Scan #1 confirmed to be quite different. Something appears to have changed between scan periods, however it is unknown exactly what occurred. Approximately 10 months elapsed between scan 1 & scan 2. All future analysis will combine absorbance measurements from scans 2, 3 & 4 and average.
Combining/averaging LPW FT-NRIS scans 2, 3 & 4 preprocessing & filtering spectra >7500 cm-1. Combining scans 2, 3 & 4 into a single average scan. Scan 1 is being ignored due to PCA above
Spectral preprocessing serves to remove unwanted noise from absorbance measurements, filter out NIR backscatter, and can improve model fits by increasing variability of absorbance measurements at wavenumbers between specimens. Initial preprocessing done with prospectr and a savitzkygolay filter. Will stick with one method and explore others later if time allows.
Removing noisy stretch of spectra > 7500 cm-1 - similar approach done by Matta et al. in Walleye Pollock daily age paper and seems typical for otolith FT-NIRS work.
scan_2 <- dfmeta_LPW %>% dplyr::filter(run_number == 2)
scan_3 <- dfmeta_LPW %>% dplyr::filter(run_number == 3)
scan_4 <- dfmeta_LPW %>% dplyr::filter(run_number == 4)
scan_avg <- bind_cols(NULL, scan_2[, 1:20])
scan_avg <- bind_cols(scan_avg, (scan_2[, 21:ncol(scan_2)] + scan_3[, 21:ncol(scan_3)] + scan_4[, 21:ncol(scan_4)]) / 3)
rm(scan_2, scan_3, scan_4, dfmeta_LPW)
# saveRDS(scan_avg, "RDS_dataframes/LPW_scan_avg_unproc.RDS")
# preprocess
scan_avg_filter <- cbind(scan_avg[, c(1:20)], savitzkyGolay(scan_avg[, 21:length(scan_avg)], m = 1, p = 3, w = 17))
scan_avg_filter <- scan_avg_filter[, -c(21:517)] # removing the noisy stretch of spectra (>7500, up to 7504)
rm(scan_avg)
# saveRDS(scan_avg_filter, "RDS_dataframes/LPW_scan_avg_proc_filter.RDS")
# dfmeta_LPW <- readRDS("RDS_dataframes/LPW_dfmeta.RDS")
# scan_avg <- readRDS("RDS_dataframes/LPW_scan_avg_unproc.RDS")
# scan_avg_filter <- readRDS("RDS_dataframes/LPW_scan_avg_proc_filter")
Look for outliers in first 2 PC’s, colored by age
scan_avg_filter <- scan_avg_filter[complete.cases(scan_avg_filter$read_age), ] # filter only aged specimens
pca_temp <- pca(scan_avg_filter[, 21:ncol(scan_avg_filter), ])
scan_avg_filter$PC1 <- pca_temp$res$cal$scores[, 1]
scan_avg_filter$PC2 <- pca_temp$res$cal$scores[, 2]
rm(pca_temp)
rm_specimens <- c("24", "52", "53", "59", "65", "67", "74")
library(plotly)
ggplotly(
ggplot() +
geom_point(data = scan_avg_filter[!c(scan_avg_filter$specimen %in% rm_specimens), ],
aes(x = PC1, y = PC2, color = read_age), size = 5
) +
geom_point(data = scan_avg_filter[c(scan_avg_filter$specimen %in% rm_specimens), ],
aes(x = PC1, y = PC2, color = read_age), shape = 17, size = 7
) +
scale_color_viridis() +
geom_point()
)
rm(rm_specimens)
Removing above identified outliers
scan_avg_filter <- scan_avg_filter %>%
dplyr::filter(read_age != 188, read_age != 181, read_age != 175, read_age != 161, read_age != 147, read_age != 135, specimen != 65)
# store metrics from each fold and each model type
RMSE.age <- list()
r2.age <- list()
AIC.age <- list()
AICc.age <- list()
mod.summary <- data.frame(
model = c("lm", "gam", "pls","pls_vip"),
r2 = 1:4,
RMSE = 1:4,
AIC = c(0,0,NA,NA),
AICc = c(0,0,NA,NA))
# 10 fold CV split - create folds
set.seed(6)
splits <- caret::createFolds(scan_avg_filter$read_age, k = 10, list = TRUE, returnTrain = FALSE)
# extract PC's for each calibration set, create test sets with ages and spectra
cal <- list()
test <- list()
for (i in 1:10) {
# calibration set and PC's
pc.mod <- preProcess(scan_avg_filter[-splits[[i]], -c(1:20)], method = "pca", thresh = 0.95, pcaComp = 10)
pc.cal <- predict(pc.mod, scan_avg_filter[-splits[[i]], -c(1:20)])
pc.cal <- cbind(pc.cal, scan_avg_filter[-splits[[i]], ])
cal[[i]] <- pc.cal
rm(pc.cal)
# test sets
pc.test <- predict(pc.mod, scan_avg_filter[splits[[i]], -c(1:20)])
pc.test <- cbind(pc.test, scan_avg_filter[splits[[i]], ])
test[[i]] <- pc.test
rm(pc.test, pc.mod)
}
# determine which PC's to include via step & AIC selection
mod.sel <- list()
for (i in 1:10) {
pctest <- cal[[i]]
temp <- step(lm(data = pctest[-splits[[i]], ], read_age ~ PC1 + PC2 + PC3 + PC4))
mod.sel[[i]] <- rownames(summary(temp)$coef)
rm(pctest, temp)
}
knitr::kable(table(unlist(mod.sel)), align = "c")
| Var1 | Freq |
|---|---|
| (Intercept) | 10 |
| PC1 | 10 |
| PC2 | 6 |
| PC3 | 10 |
| PC4 | 8 |
rm(mod.sel)
PC 2 will be excluded for now.
# lm() with 10-fold CV
mods.lm <- list()
for (i in 1:10) {
calibrate <- cal[[i]]
testing <- test[[i]]
mod <- lm(data = calibrate, read_age ~ PC1 + PC3 + PC4)
# RMSE.age$lm.cal[i] <- caret::RMSE(pred = mod$fitted.values, obs = calibrate[, "read_age"])
preds <- predict(mod, newdata = testing)
RMSE.age$lm.test[i] <- caret::RMSE(pred = preds, obs = testing[, "read_age"])
# r2.age$lm.cal[i] <- summary(mod)$r.squared
RSS <- sum((testing$read_age - preds)^2)
TSS <- sum((testing$read_age - mean(testing$read_age))^2)
r2.age$lm.test[i] <- 1 - (RSS / TSS)
AIC.age$lm[i] <- AIC(mod)
AICc.age$lm[i] <- AICc(mod)
mods.lm[[i]] <- mod
rm(mod, preds, RSS, TSS)
}
mod.summary[1,2] <- round(mean(r2.age$lm.test), 3)
mod.summary[1,3] <- round(mean(RMSE.age$lm.test), 3)
mod.summary[1,4] <- round(mean(AIC.age$lm) , 3)
mod.summary[1,5] <- round(mean(AICc.age$lm) , 3)
# RMSE.age$lm.cal <- mean(RMSE.age$lm.cal)
# r2.age$lm.cal <- mean(r2.age$lm.cal)
# GAM with 10 fold CV, select = T allows PCs to be penalized and effectively removed from model if appropriate.
mods.GAM <- list()
for (i in 1:10) {
calibrate <- cal[[i]]
testing <- test[[i]]
mod <- gam(data = calibrate, read_age ~ s(PC1) + s(PC2) + s(PC3) + s(PC4), method = "REML")
# Extract AIC, AICc, RMSE (cal & test) & r2 (cal & test)
# RMSE.age$GAM.cal[i] <- caret::RMSE(pred = mod$fitted.values, obs = calibrate[, "read_age"])
preds <- predict(mod, newdata = testing)
RMSE.age$GAM.test[i] <- caret::RMSE(pred = preds, obs = testing[, "read_age"])
# r2.age$gam.cal[i] <- summary(mod)$r.sq
RSS <- sum((testing$read_age - preds)^2)
TSS <- sum((testing$read_age - mean(testing$read_age))^2)
r2.age$gam.test[i] <- 1 - (RSS / TSS)
AIC.age$gam[i] <- AIC(mod)
AICc.age$gam[i] <- AICc(mod)
mods.GAM[[i]] <- mod
rm(RSS, TSS, preds, i, mod, calibrate, testing)
}
mod.summary[2,2] <- round(mean(r2.age$gam.test), 3)
mod.summary[2,3] <- round(mean(RMSE.age$GAM.test), 3)
mod.summary[2,4] <- round(mean(AIC.age$gam), 3)
mod.summary[2,5] <- round(mean(AICc.age$gam), 3)
rm(AIC.age, AICc.age)
mods.pls <- list()
mods.vip <- list()
# out.mods <- list()
# pls, no variable selection & VIP > 1
for (i in 1:10) {
calibrate <- scan_avg_filter[-splits[[i]], ]
testing <- scan_avg_filter[splits[[i]], ]
mod <- pls(calibrate[, 31:ncol(calibrate)], calibrate[, "read_age"],
ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
info = "Age Prediction Model",
x.test = testing[, 31:ncol(testing)],
y.test = testing[, "read_age"]
)
ncomp <- mod$ncomp.selected
RMSE.age$pls.test[i] <- mod$testres$rmse[[ncomp]]
r2.age$pls.test[i] <- mod$testres$r2[[ncomp]]
mods.pls[[i]] <- mod
mods.pls[[1]]$ncomp.selected
# r2.age$pls.cal[i] <- mod$calres$r2[[ncomp]]
# RMSE.age$pls.cal[i] <- mod$calres$rmse[[ncomp]]
# # Exclude outliers from above
# outliers <- which(categorize(mod, mod$res$cal) == "extreme")
# mod.out <- pls(calibrate[-outliers, 31:ncol(calibrate)], calibrate[-outliers, "read_age"],
# ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
# info = "Age Prediction Model",
# x.test = testing[-outliers, 31:ncol(testing)],
# y.test = testing[-outliers, "read_age"]
# )
# ncomp <- mod.out$ncomp.selected
# # RMSE.age$pls.cal[i] <- mod$calres$rmse[[ncomp]]
# RMSE.age$pls.outliers[i] <- mod.out$testres$rmse[[ncomp]]
# # r2.age$pls.cal[i] <- mod$calres$r2[[ncomp]]
# r2.age$pls.outliers[i] <- mod.out$testres$r2[[ncomp]]
# out.mods[[i]] <- mod.out
# VIP < 1 excluded
vip <- as.data.frame(vipscores(mod))
mod <- pls(calibrate[, 31:ncol(calibrate)], calibrate[, "read_age"],
ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
info = "Age Prediction Model",
x.test = testing[, 31:ncol(testing)],
y.test = testing[, "read_age"],
exclcols = vip$V1 < 1
)
ncomp <- mod$ncomp.selected
r2.age$vip.test[i] <- mod$testres$r2[[ncomp]]
mods.vip[[i]] <- mod
RMSE.age$vip.test[i] <- mod$testres$rmse[[ncomp]]
# r2.age$vip.cal[i] <- mod$calres$r2[[ncomp]]
# RMSE.age$vip.cal[i] <- mod$calres$rmse[[ncomp]]
rm(calibrate, mod, ncomp, testing, i, vip)
}
# RMSE.age$pls.cal <- mean(RMSE.age$pls.cal)
# RMSE.age$pls.outliers <- mean(RMSE.age$pls.outliers)
# r2.age$pls.cal <- mean(r2.age$pls.cal)
# r2.age$pls.outliers <- mean(r2.age$pls.outliers)
# RMSE.age$vip.cal <- mean(RMSE.age$vip.cal)
mod.summary[3, 2] <- round(mean(r2.age$pls.test), 3)
mod.summary[3, 3] <- round(mean(RMSE.age$pls.test), 3)
mod.summary[4, 2] <- round(mean(r2.age$vip.test), 3)
mod.summary[4, 3] <- round(mean(RMSE.age$vip.test), 3)
# sapply(mods.vip,plot)
# sapply(mods.pls,plot)
rm(r2.age, RMSE.age)
knitr::kable(mod.summary, align = "c")
| model | r2 | RMSE | AIC | AICc |
|---|---|---|---|---|
| lm | 0.766 | 11.865 | 383.176 | 384.585 |
| gam | 0.753 | 11.779 | 370.186 | 380.351 |
| pls | 0.805 | 10.345 | NA | NA |
| pls_vip | 0.785 | 10.324 | NA | NA |
Show high correlation of sequential wavenumbers. 100 shown here.
corrplot::corrplot(cor(scan_avg_filter[, 150:250]), tl.pos = "n", cl.cex = 2)
# Using test/cal split # 5 for model diagnostics
# ggplot version of lm diagnostics
gglm(mods.lm[[5]], theme = theme_bw(),
theme(plot.title = element_text(size = 22),
axis.title = element_text(size = 20),
axis.text = element_text(size = 18))
)
# Using test/cal split # 5 for model diagnostics
# GAM diagnostics
appraise(mods.GAM[[5]]) &
theme_bw() &
theme(plot.title = element_text(size = 15),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
# GAM partial effects
draw(mods.GAM[[5]], residuals = T) &
theme_bw() &
theme(plot.title = element_text(size = 15),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
Demonstrate fish got larger over time
ggplot(scan_avg_filter, aes(x = sample_date, y = length)) +
geom_point(size = 3) +
geom_smooth(method = "lm", col = "black") +
theme(
axis.title = element_text(size = 20),
axis.text = element_text(size = 15)
) +
theme_bw() +
xlab("Sample Date") +
ylab("Fork length (mm)")
## `geom_smooth()` using formula = 'y ~ x'
Colored by length with outliers removed marked as triangles
age_only <- readRDS("RDS_dataframes/LPW_scan_avg_proc_filter.RDS")
age_only <- age_only[complete.cases(age_only$read_age), ]
pca_temp <- pca(age_only[,21:ncol(age_only),])
age_only$PC1 <- pca_temp$res$cal$scores[,1]
age_only$PC2 <- pca_temp$res$cal$scores[,2]
rm(pca_temp)
rm_specimens <- c("24", "52", "53", "59", "65", "67", "74")
library(plotly)
ggplotly(
ggplot() +
geom_point(data = age_only[!c(age_only$specimen %in% rm_specimens),],
aes(x = PC1, y = PC2, color = read_age), size = 5) +
geom_point(data = age_only[c(age_only$specimen %in% rm_specimens),],
aes(x = PC1, y = PC2, color = read_age), shape = 17, size = 7) +
scale_color_viridis()
)
rm(age_only, rm_specimens)
# Create dataframe only with ages contained in metadata
age_only <- readRDS("RDS_dataframes/LPW_scan_avg_proc_filter.RDS")
age_only <- age_only[complete.cases(age_only$read_age), ]
age_only <- age_only[complete.cases(age_only$structure_weight), ]
rm_specimens <- c("24", "52", "53", "59", "65", "67", "74")
age_only <- age_only %>% filter(!specimen %in% rm_specimens)
rm(rm_specimens)
# pca_temp <- pca(scan_avg_filter[,21:ncol(scan_avg_filter),])
# store metrics from each fold and each model type
RMSE.age <- list()
r2.age <- list()
AIC.age <- list()
AICc.age <- list()
mod.lw.summary <- data.frame(
model = c("lm", "gam"),
r2 = 1:2,
RMSE = 1:2,
AIC = c(0,0),
AICc = c(0,0))
set.seed(6)
splits <- caret::createFolds(age_only$read_age, k = 10, list = TRUE, returnTrain = FALSE)
mods.lw.lm <- list()
for (i in 1:10) {
calibrate <- age_only[-splits[[i]],]
testing <- age_only[splits[[i]],]
mod <- lm(data = calibrate, read_age ~ length + structure_weight)
preds <- predict(mod, newdata = testing)
RMSE.age$lm.test[i] <- caret::RMSE(pred = preds, obs = testing[, "read_age"])
RSS <- sum((testing$read_age - preds)^2)
TSS <- sum((testing$read_age - mean(testing$read_age))^2)
r2.age$lm.test[i] <- 1 - (RSS / TSS)
AIC.age$lm[i] <- AIC(mod)
AICc.age$lm[i] <- AICc(mod)
mods.lw.lm[[i]] <- mod
rm(mod, preds,RSS,TSS)
}
mod.lw.summary[1,2] <- round(mean(r2.age$lm.test),3)
mod.lw.summary[1,3] <- round(mean(RMSE.age$lm.test),3)
mod.lw.summary[1,4] <- round(mean(AIC.age$lm),3)
mod.lw.summary[1,5] <- round(mean(AICc.age$lm),3)
mods.lw.GAM <- list()
for (i in 1:10) {
calibrate <- age_only[-splits[[i]], ]
testing <- age_only[splits[[i]], ]
mod <- gam(data = calibrate, read_age ~ s(length) + s(structure_weight), method = "REML")
# Extract AIC, AICc, RMSE (cal & test) & r2 (cal & test)
# RMSE.age$GAM.cal[i] <- caret::RMSE(pred = mod$fitted.values, obs = calibrate[, "read_age"])
preds <- predict(mod, newdata = testing)
RMSE.age$GAM.test[i] <- caret::RMSE(pred = preds, obs = testing[, "read_age"])
# r2.age$gam.cal[i] <- summary(mod)$r.sq
RSS <- sum((testing$read_age - preds)^2)
TSS <- sum((testing$read_age - mean(testing$read_age))^2)
r2.age$gam.test[i] <- 1 - (RSS / TSS)
AIC.age$gam[i] <- AIC(mod)
AICc.age$gam[i] <- AICc(mod)
mods.lw.GAM[[i]] <- mod
rm(mod, preds, RSS,TSS)
}
mod.lw.summary[2,2] <- round(mean(r2.age$gam.test),3)
mod.lw.summary[2,3] <- round(mean(RMSE.age$GAM.test),3)
mod.lw.summary[2,4] <- round(mean(AIC.age$gam),3)
mod.lw.summary[2,5] <- round(mean(AICc.age$gam),3)
rm(AIC.age,AICc.age, RMSE.age,r2.age)
knitr::kable(mod.lw.summary, align = "c")
| model | r2 | RMSE | AIC | AICc |
|---|---|---|---|---|
| lm | 0.756 | 11.082 | 355.694 | 356.672 |
| gam | 0.760 | 10.833 | 348.363 | 352.259 |
| model | r2 | RMSE | AIC | AICc |
|---|---|---|---|---|
| lm | 0.766 | 11.865 | 383.176 | 384.585 |
| gam | 0.753 | 11.779 | 370.186 | 380.351 |
| pls | 0.805 | 10.345 | NA | NA |
| pls_vip | 0.785 | 10.324 | NA | NA |
| model | r2 | RMSE | AIC | AICc |
|---|---|---|---|---|
| lm | 0.756 | 11.082 | 355.694 | 356.672 |
| gam | 0.760 | 10.833 | 348.363 | 352.259 |
Predict structure weight using FT-NIRS
# Create dataframe only with ages contained in metadata
age_only <- readRDS("RDS_dataframes/LPW_scan_avg_proc_filter.RDS")
age_only <- age_only[complete.cases(age_only$read_age), ]
age_only <- age_only[complete.cases(age_only$structure_weight), ]
rm_specimens <- c("24", "52", "53", "59", "65", "67", "74")
age_only <- age_only %>% filter(!specimen %in% rm_specimens)
rm(rm_specimens)
# pca_temp <- pca(scan_avg_filter[,21:ncol(scan_avg_filter),])
set.seed(6)
splits <- caret::createFolds(age_only$structure_weight, k = 10, list = TRUE, returnTrain = FALSE)
mods.sw.pls <- list()
mods.sw.vip <- list()
RMSE.weight <- list()
r2.weight <- list()
mod.sw.summary <- data.frame(
model = c("pls", "pls_vip"),
r2 = 1:2,
RMSE = 1:2)
# test pls, no variable selection
for (i in 1:10) {
calibrate <- age_only[-splits[[i]], ]
testing <- age_only[splits[[i]], ]
mod <- pls(calibrate[, 31:ncol(calibrate)], calibrate[, "structure_weight"],
ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
info = "Otolith Weight Prediction Model - No VIP",
x.test = testing[, 31:ncol(testing)],
y.test = testing[, "structure_weight"])
ncomp <- mod$ncomp.selected
# RMSE.age$pls.cal[i] <- mod$calres$rmse[[ncomp]]
RMSE.weight$pls.test[i] <- mod$testres$rmse[[ncomp]]
# r2.age$pls.cal[i] <- mod$calres$r2[[ncomp]]
r2.weight$pls.test[i] <- mod$testres$r2[[ncomp]]
mods.sw.pls[[i]] <- mod
# VIP < 1 excluded
vip <- as.data.frame(vipscores(mod))
mod <- pls(calibrate[, 31:ncol(calibrate)], calibrate[, "structure_weight"],
ncomp.selcrit = "wold", scale = F, center = T, cv = 1,
info = "Otolith Weight Prediction Model - VIP",
x.test = testing[, 31:ncol(testing)],
y.test = testing[, "structure_weight"],
exclcols = vip$V1 < 1
)
ncomp <- mod$ncomp.selected
# RMSE.age$vip.cal[i] <- mod$calres$rmse[[ncomp]]
RMSE.weight$vip.test[i] <- mod$testres$rmse[[ncomp]]
# r2.age$vip.cal[i] <- mod$calres$r2[[ncomp]]
r2.weight$vip.test[i] <- mod$testres$r2[[ncomp]]
mods.sw.vip[[i]] <- mod
rm(mod, ncomp, vip)
}
mod.sw.summary[1,2] <- round(mean(r2.weight$pls.test),3)
mod.sw.summary[1,3] <- round(mean(RMSE.weight$pls.test),7)
mod.sw.summary[2,2] <- round(mean(r2.weight$vip.test),3)
mod.sw.summary[2,3] <- round(mean(RMSE.weight$vip.test),7)
rm(r2.weight,RMSE.weight)
knitr::kable(mod.sw.summary, align = "c")
| model | r2 | RMSE |
|---|---|---|
| pls | 0.996 | 0.0003909 |
| pls_vip | 0.995 | 0.0004309 |